Speech synthesizer



June 27, 1967 J. L. KELLY, JR

SPEECH SYNTHESIZER 2 Sheets-Sheet l Filed Dec. 50, 1963 'TORNEV/A/f/E/WOR @VJ L KELL Y JR 2 Sheets-Sheet 2 Filed Dec. 30, 1963 UnitedStates Patent 3,328,525 SPEECH SYNTHESIZER John L. Kelly, Jr., BerkeleyHeights, NJ., assignor to Bell Telephone Laboratories, Incorporated, NewYork, N.Y., a corporation of New York Filed Dec. 30, 1963, Ser. No.334,354 4 Claims. (Cl. 179-1) This invention relates to the synthesis ofspeech, and in particular to the synthesis of natural sounding speech inbandwidth compression systems.

Conventional speech communication systems, for example, commercialtelephone system, typi-cally convey human speech by transmitting anelectrical facsimile of the acoustic waveform produced by a humantalker. Because of the redundancy of human speech, however, facsimiletrans-mission is a relatively inefficient way to transmit speechinformation, and it is well known thatV the information contained in atypical speech sound may be transmitted over a channel of substantiallynarrower bandwidth than `that required for facsimile transmission of thespeech waveform.

A number of arrangements for compressing or reducing the amount ofbandwidth employed in the transmission of speech information have beenproposed, one of the best known of these arrangements being theso-called resonance vocoder. A specific version of the resonance vocoderis described in J. C. Steinberg Patent 2,635,146, issued Apr. 14, 1953.

The distinctive feature of resonance vocoder systems is the transmissionof speech information in terms of narrow bandwidth control signalsrepresentative of the frequency locations of selected peaks or maxima inthe speech 'amplitude spectrum which correspond to the principalyformants or resonances of the human vocal tract. A typical resonancevocoder system includes at a transmitter station an analyzer forderiving from an incoming speech 'wave a group of narrow bandwidthcontrol signals including formant contral signals representative of thefrequencies of selected formant peaks in the speech spectrum. Aftertransmission to a receiver station, the control signals are applied to asynthesizer that is provided with controllable resonant circuits forshaping an artificial spectrum to have peaks at frequencies specified bythe formant control signals, thereby reconstructing a replica of thespectrum of the original speech wave.

From the standpoint of efficiency, it is of course desirable to transmitthe smallest number of control signals consistent with a desired levelof intelligibility and naturalness in the reconstructed speech wave. Itis well known that the first three formants in order of frequencycontribute most to the intelligibility of speech; accordingly, it iscommon practice to transmit three formant control signals representativeof the locations of these three principal formant locations. From thestandpoint of speechquality, however, it has been Ifound that higherfrequency formants contribute significantly to the'naturalness ofreconstructed speech, but of course the transmission of additionalformant control signals requires a greater transmission channelbandwidth, thereby decreasing the bandwith efiiciency of a resonancevocoder system.

One arrangement that improves thepquality of reconstructed speechwithout decreasing bandwidth efficiency is described in an article by I.L. Flanagan and A. S. House, Development and Testing of a Formant-CodingSpeech Compression System, volume 28, Journal of the Acoustical Societyof America, page 1099, (1957). In the Flanagan- House system, the threeprincipal lower order formants are represented by control signals thatare transmitted from an analyzer to a synthesizer, as in a conventionalresonance vocoder, but in addition to shaping an artificial ice spectrumto have three peaks at frequencies specified by the three transmittedformant control signals, the synthesizer is provided with a separatefixed frequen-cy resonant circuit that shapes the artificial spectrum tohave a fourth peak at a fixed frequency corresponding to the averagelocation of a fourth, high frequency formant. However, the human vocaltract is characterized by an infinite number of resonances or formants,`and therefore the Flanagan- House arrangement does not specifycompletely the higher order speech formants.

The present invention improves the quality of speech reconstructed in aresonance vocoder without decreasing lbandwidth efficiency by providingat a resonance vocoder receiver station a novel arrangement for shapinglan artificial spectrum to have an infinite number of peaks at selectedfixed frequencies corresponding to the locations of higher order speechformants, In the apparatus of this invention there is provided a novelresonant circuit having an infinite number of resonances at selectedfixed frequencies, the higher frequency resonances of this resonantcircuit corresponding to the frequencies of higher order speechformants. One yor more ofthe lower frequency resonances of this circuitmay lie within the frequency range of the formants represented bytransmitted control signals, hence these lower-frequency resonances arecanceled or removed by separate antiresonant circuits having.antiresonances at fixed frequencies corresponding to the unwanted lowerfrequency resonances of the resonant circuit. In the present inventionan artificial spectrum is first shaped in conventional fashion byadjustable resonant circuits to have lower order formant peaks atfrequencies specified by transmitted formant control signals, followingwhich the artificial spectrum is further shaped by the apparatus of thepresent invention to have higher order formant peaks at fixedfrequencies specified by the uncanceled resonances of the resonantcircuit provided in this invention. l

The invention will be fully understood from the following detaileddescription of illustrative embodiments thereof, taken in connectionwith the appended drawings, in which:

FIG. 1 is a block diagram showing a complete resonance vocoder systemembodying the principles of this invention;

FIG. 2 is a circuit diagram showing in detail a specific embodiment ofan antiresonant circuit employed in this invention; and

FIGS. 3A and 3B are diagrams of assistance in explaining the features ofthis invention.

Referring first to FIG. l, elements 11 and 12 respectively represent the-analyzer and synthesizer portions of a typical formant vocoder system.Formant vocoder analyzer 11, which is ordinarily located at atransmitter station, includes a pitch detector 111, a voiced amplitudedetector 112,- land a formant frequency detector 113, which respectivelyderive from a speech wave from source 1f), for example, a conventionalmicrophone, a group of narrow bandwidth control signals respectivelyrepresentative of the fundamental glottal excitation frequency, F0, theamplitude of the glottal excitation, AV, and the frequencies F1, FN,N=2,3, of selected maxima in the spectrum of the incoming speech wave.The frequency locations of these maxima in the speech spectrumcorrespond to natural frequencies of the formants or normal modes ofvibration of the human vocal tract, and as the 'shape of the vocal tractis deformed during the articulation of different speech sounds, thenatural or formant frequencies and the corresponding locations ofspectral maxima also change. It is generally accepted that the mostimportant formants are the three having the lowest frequencies, and inmost formant vocoders, the formant control signals derived by ananalyzer represent the first three formants; that is, three is asuitable value for N in the apparatus of FIG. 1.

The control signals derived by analyzer 11` are delivered by way of asuitable transmission medium, indicated by broken lines, to asynthesizer 12 located at a receiver station. In synthesizer 12 there isprovided a lbuzz source 121 that generates a relatively fiat artificialamplitude spectrum comprising a plurality of relatively uniformamplitude harmonics of the fundamental frequency FO, and an amplitudemodulator 122 that adjusts the uniform amplitude of the harmonics of theartificial spectrum from source 121 to represent the glottal excitationamplitude AV. Synthesizer 12 is further provided with a cascade ofuncoupled resonant circuits 123-1 through 12S-N, each. of which has anadjustable resonance that is individually tuned by a corresponding oneof the N formant control signals transmitted from analyzer 11. Bysuccessively passing the artificial spectrum of uniform amplitudeharmonies from modulator 122 through circuits 123-1 through 123-N, the`spectrum is shaped by the resonances of circuits 12341 through 123-N toresemble the spectrum of the original speech wave from source Thus theadjustable resonances of circuits 123-1 through 123-N shape the spectrumfrom modulator-122 to have N maxima at frequencies F1 through FNcorresponding to the locations of the N formant peaks in the originalspeech spectrum which are represented by the N transmitted formantcontrol signals.

It is evident that the synthesized spectrum developed at the outputterminal of synthesizer 12 is limited in its resemblance to the originalspeech spectrum in that the number of peaks in synthesized spectrum isdetermined by the number of transmitted formant control signals. Thus inthe usual situation Where three formant control signals representativeof the three principalspeech formants are transmitted, the synthesizedspectrum developed by synthesizer 12 has only three maxima. A closerresemblance to original speech spectrum is obtained in the presentinvention by passing the synthesized spectrum from synthesizer 12through higher order formant synthesizer 13 in order to shape furtherthe synthesized spectrum to have additional maxima at frequenciescorre-Y sponding to the locations of maxima in the original speechspectrum which are not represented by transmitted control signals.

Within synthesizer 13, the synthesized spectrum from synthesizer 12 ispassed through M series-connected antiresonant circuits 131-1 through131-M toresonant circuit 133, which comprises a delay line 133a oflength 1- seconds in negative feedback relation through subtractor 132with an amplifier 133b `having a gain e-T less than unity. As describedin detail below, resonant circuit 133 theoretically has an infinitenumber of resonances at fixed frequencies dependent upon the length 1-of delay line 13317. The fixed resonances of feedback circuit 133 shapethe incoming synthesized spectrum from synthesizer 12 to have maximawhich correspond to desired higher order formant locations not specifiedby the transmitted formant control signals. Since feedback circuit 133has fixed resonances at low frequencies as well as at high frequencies,l

and since the synthesized spectrum from synthesizer 12 has already beenshaped to have N maxima corresponding to N low frequency formants, oneor more of the low frequency resonances of feedback circuit 133 must becanceled. Cancellation `of M selected low frequency resonances offeedback circuit 133 is accomplished by the M antiresonant circuits131-1 through 131-M, M :1,2, detailed illustration of a suitableantiresonant circuit being shown in FIG. 2 and explained below. Afterthe synthesized spectrum from synthesizer 12 has been further shaped bysynthesizer 13, the spectrum may be converted into intelligible speechsounds by a suitable transducer 14, for example, a conventionalloudspeaker.

Before describing the theoretical,considerations underlying theconstruction of resonant circuit 133, it will be helpful at this pointto describe the properties Aof the human vocal tract during vowelproduction in terms of Laplace transform notation. The ratio of theLaplace transform, U2(S), ofthe volume velocity of air through the lips,U2U), to the Laplace transform, U1(s) of the volume velocity of airthrough the glottis,.U2(t), this ratio being commonly known as thetransfer characteristic of the vocal tract, can be approximated by arational transfer function having the following form:

able;,sk=("k+jwk) is a complex number representing a formant or normalmode of vibration of the.vocal tract;

and sk* is the complex conjugate of sk. Equation 1 indi-v cates thaitthe glottis-to-mouth transfer functions for vowel sounds has only poles,denoted sk, and no zeros, and the the poles coincide with thenormalmodes of vibration. The locations of these poles are illustratedgraphically in the pole diagram of FIG. 3A, where the Xs indicate thelocations in the s-plane of the complex numbers s1, s2, s3, representingthe first three poles or rformants, and the dashed lines indicate higherfrequency poles.

Since the vocal tract is a distributed acoustic system, it has in theoryan infinite number of natural frequencies which change in value withtime las the vocal tract is deformed during the articulation ofdifferent speech sounds, and correspondingly, the poles of the transfercharacteristic in Equation 1 also change with time. However, only atrelatively low frequencies, usually including no more than the firstthree natural frequencies, do the natural frequencies changesubstantially in value with deformations of the vocal tract, Whereas athigh frequencies, the natural frequencies asymptotically approach auniform spacing in frequency.

Synthesizer 13 of the present invention is provided with an infinitenumber of natural resonant frequencies having a uniform spacing infrequency corresponding t0 that of the higher order natural frequenciesof the Vocal tract by constructing resonant circuit 133 and:antiresonant circuits 131-1 through 131-N in the following manner. Asshown in FIG, 1, resonant circuit-133 comprises delay line 133a oflength r in feedback relation through subtractor 132 with amplifier13317 having gain fr", where a suitable value for e-ff may be on theorder of 2/ 3. In Laplace transform notation, the transfercharacteristic of delay line 133a is ers?, hence the combined f transfercharacteristic of delay line 13'3a and amplifier 133b is the product.

Since the elements 133a and 133b are in negative feedback relation witheach other through subtractor 132, the product in Equation 2 correspondsto the familiar in the transfer characteristic relating the incomingsignal F1 applied to the minuend terminal of subtractor 132 to theoutgoing signal F2 developed at the output terminal of circuit 133,

and since e$=1 is periodic, the positive frequency poles of Fz/Fl occurat odd integral multiples of fr,

r(Sn+ff)=-J"(n1f) .77, S=o'i 7;1r, (12:1, 3, (5a) From Equation 5a it isevident that F2/F1 has an infinite number of poles, the radian frequencyof the nth pole being or, since w11=21rf11 denotes frequency in cyclesper second,

The spacing in frequency between poles is uniform, being 21r/r in radiusand 1/1- in cycles per second. Further, each pole has the same constantreal part, -r, which represents the so-called formant damping, and whichis manifested in the speech spectrum by the bandwidth of the formantpeaks.

FIG. 3B illustrates graphically the locations of the poles of F11/F1, inwhich it is noted that a particular uniform spacing in frequency may beobtained by suitably choosing the length T of delay line 133a and thecorresponding factor -r in the gain of amplifier 133b. A suitable valuefor the length of delay line 133a 7- may be on the order of onemillisecond, which corresponds to the round t-rip delay of the humanvocal tract, thereby placing the poles of feedback circuit 133 atfrequencies of approximately 500, 1500, 2500 cycles per second.Similarly, a suitable value for the formant bandwidth factor a in thegain of amplifier 133th may be on the order of 400 nepers per second,corresponding to a formant bandwidth of about 130` cycles per second.

Depending upon the value selected for the length of delay line 133a, oneor more of the fixed poles or resonances of resonant circuit 133 mayoccur within the frequency range of formants represented by the controlsignals transmitted from ianalyzer 11 to synthesizer 12. For example, byselecting the length of delay line 133a to -be one millisecond, the rstthree xed poles of circuit 133 occur at 500, 1500, and 2500 cycles persecond, which respectively `lie within the frequency yranges of thefirst three speech formants typically represented by transmitted controlsignals, In this situation it is therefore necessary to remove or cancelthe three lowerorder poles of resonant circuit 133 which lie within thefrequency range of the formants Irepresented by transmitted controlsignals in order to prevent interference with the maxima previouslysynthesized in the artificial spectrum from synthesizer 12. FIG. 3Billusr-ates the situation in which the rst three poles of circuit 133are to be canceled, as indicated by the three Xs enclosed in circles.

Antiresonant circuits 131-1 through 131-M, which precede feedbackcircuit 133, are designed to cancel M=l,2, unwanted lower order poles offeedback circuit 133 in the following manner. In order for a particularantiresonant circuit, say 131-1, to cancel a corresponding 4one of thepoles of resonant circuit 133, say the pole denoted s'1.=( a1-fm1), itis necessary for the transfer characteristic of circuit 131-1 to have asingle zero at s1='(- rija and no other poles or zeros. That is, if Z1denotes the transfer characteristic of circuit 131-1, then Z1 must beproportional to (tv-s1) (s-s1*),

where A is a constant. A suitable realization of a circuit having atransfer characteristic of the type specified by Equation 6 is shown indetail in FIG. 2, it being understood that other antiresonant circuitshaving a suitable transfer characteristic may be employed, if desired.Further, it is understood that the transfer characteristic required forthe cancellation of other unwanted poles by antiresonant circuits,131-2, 131-M, may be obtained by substituting other poles s2, s2*; r11,SM* for the quantities s1, s1* in Equation 6.

Turning now to FIG. 2, -the synthesized speech spectrum from synthesizer12 is applied through a sufficiently high resistance R3 to pass aconstant current through the series connected inductance, resistance andcapacitance elements respectively denoted L1, R1, and C1, and to apply aconstant voltage to cathode folower V1. The output voltage of cathodefollower V1 is differentiated by capacitance element C2 and resistanceelement R2, and the differentiated output signal is passed to the nextantiresonant circuit.

The impedance of elements L1, R1, C1, in Laplace transform notation, maybe written www? and differentiating elements R1 and C2 change thisimpedance by a multiplicative factor s, so that the transfercharacteristic of circuit 131-1 may be written From Equation 8 it isevident that the values of the inductance, resistance, and capacitanceelements L1, R1,

and C1 may be determined from the predetermined values of a and w1according to the following relations:

Although this invention has been described in terms of a resonancevocoder system of the type shown in FIG. 1, it is to be understood thatapplications of the principles of this invention are not limited to thisparticular system, but include other resonance vocoder systems as wellas various kinds of speech processing equipment in which speechform-ants are synthesized. In addition, it is to be understood that theabove-described embodiments are merely illustrative of the numerousarrangements that may be devised for the principles of this invention bythose skilled in the art without departing from the spirit and scope ofthe invention.

What is claimed is:

, 1. In a resonance vocoder synthesizer,

a source of a plurality of control signals including a pitch controlsignal, an amplitude control signal, and a group of formant controlsignals representative of the frequencies of selected low frequencyformant peaks in the spectrum of an original speech wave,

first synthesizing means responsive to said plurality of controlsignals. for developing an artificial speech spectrum having a firstgroup of peaks lat frequencies represented by said formantcontrolsignals, and second synthesizing means for Ishaping saidartificial speech spectrum to have a second group of peaks at selectedfixed frequencies representative of high lfrequency speech formants,said second synthesizing Y means including a resonant circuit having atransfer characteristic with no zeroes and an infinite number of polesat equally spaced predetermined frequencies, wherein the higherfrequency poles of said transfer characteristic correspond in frequencyto high frequency speech formants, andy a plurality of series-connectedantiresonant circuits preceding said resonant circuit, each of saidantiresonant circuits having a transfer characteristic with no poles anda zero at a predetermined frequency corresponding to an unwanted pole insaid transfer characteristic vvof said resonant circuit, and

means for applying said artificial speech spectrum to said secondsynthesizing means. 2. In combination with a resonance vocodersynthesizer that generates an artificial speech spectrum having peaks atselected low frequencies corresponding to selected low frequency formantpeaks in the spectrum of an original speech wave, apparatus forintroducing additional peaks into said artificial speech spectrum `atselected frequencies corresponding to selected high frequency formantswhich comprises a resonant circuit having a transfer characteristic withno zeros and an infinite number of poles at equally spaced lfixedfrequencies, wherein the higher frequency poles of said transfercharacteristic correspond in frequency to selected high frequencyAspeech formants, and Y a plurality of series-connected antiresonantcircuits in preceding circuit relation with said resonant circuit forcancelling a corresponding plurality of unwanted poles in the transfercharacteristic of said resonant circuit.

3. Apparatus for synthesizing a plurality of peaks of predeterminedwidth at selected fixed frequencies in an incoming amplitude spectrumwhich comprises a plurality of series-connected antiresonant circuitseach provided with a transfer characteristic having a single zero atS=-a+ QW, (n=1, 3,

means for applying said incoming amplitude spectrum to said input point,of said plurality of antiresonant circuits, and

means for conencting the output point of said pluralityl of antiresonantcircuits to said resonant circuit.

4. Apparatus for synthesizing an artificial spectrum having a pluralityof peaks at selected high frequency locations so that said peaks in saidartificial spectrum closely resemble the formant peaks at high frequencylocations in the spectrum of an original speech wave, which comprisesmeans for developing an incoming artificial spectrum having peaks atselected low frequency locations corresponding to formant peaks atselected low frequency locations in said spectrum of said originalspeech wave,

, a plurality M (M=l, 2, of series-connected antiresonant circuits forpreventing the Occurrence of peaks at a corresponding plurality ofunwanted high 8 frequency locations, wherein the nth of saidantiresonant circuits, n=l,2, M, comprises an input terminal,

a cathode follower provided withan input point and an output point,

a first resistance element connected between said input terminal andsaid input point of said cathode followan inductance element L, a secondresistance element Rm, and a first capacitance element Cn connected inseries between said input point of said cathode follower and ground,

an output terminal,

a second capacitance element connected between said output point of saidcathode follower and said output terminal, and

a third resistance element connected between said output terminal andground,

wherein the values of said inductance element L, said second resistanceelement Rh, and said first capaci tance element Cn are determinedby thebandwidth a and the frequency wn of said unwanted peak at said highfrequency location according to the following relationships:

means for aplying said incoming artificial spectrum to said inputterminal of the first of said antiresonant circuits, n=l,

a resonant circuit including subtracting means provided with a minuendterminal,

a subtrahend terminal and an output terminal,

a delay element having a delay time of r seconds and provided 4with aninputterminal and an output terminal,

an amplifier having a gain e-"r so that said resonant circuit has aninfinite number of resonances at frequencies said amplifier beingprovided with an input terminal and au output terminal,

means for connecting said output terminal of the last of saidantiresonant circuits, n=M, to said minuend terminal of said subtractingmeans,

means for connecting said output terminal of said subtracting` means tosaid input terminal of said delay element,

means for connecting vsaid output terminal of said delay element to saidinput terminal of said amplifier, and

means forV connecting said output terminal of said amplifier to saidsubtrahend terminal of said subtracting means,

thereby developing from said incoming artificial spectrum an outgoingartificial spectrum atsaid o-utput terminal of said delay element,wherein said outgoing artificial spectr-um has peaks atselected lowfrequency locations corresponding to said peaks of said incomingartificial spectrum'and peaks at selected high frequency locationscorresponding to resonances of said resonant circuit .at frequencies wi,for i greater than M.

No references cited.

KATHLEEN H. CLAFFY, Primary Examiner.

R. MURRAY, Assistant Examiner.

1. IN A RESONANCE VOCODER SYNTHESIZER, A SOURCE OF A PLURALITY OFCONTROL SIGNALS INCLUDING A PITCH CONTROL SIGNAL, AN AMPLITUDE CONTROLSIGNAL, AND A GROUP OF FORMANT CONTROL SIGNALS REPRESENTATIVE OF THEFREQUENCIES OF SELECTED LOW FREQUENCY FORMANT PEAKS IN THE SPECTRUM OFAN ORIGINAL SPEECH WAVE, FIRST SYNTHESIZING MEANS RESPONSIVE TO SAIDPLURALITY OF CONTROL SIGNALS FOR DEVELOPING AN ARTIFICIAL SPEECHSPECTRUM HAVING A FIRST GROUP OF PEAKS AT FREQUENCIES REPRESENTED BYSAID FORMANT CONTROL SIGNALS, AND SECOND SYNTHESIZING MEANS FOR SHAPINGSAID ARTIFICIAL SPECH SPECTRUM TO HAVE A SECOND GROUP OF PEAKS ATSELECTED FIXED FREQUENCIES REPRESENTATIVE OF HIGH FREQUENCY SPEECHFORMANTS, SAID SECOND SYNTHESIZING MEANS INCLUDING A RESONANT CIRCUITHAVING A TRANSFER CHARACTERISTIC WITH NO ZEROES AND AN INFINITE NUMBEROF POLES AT EQUALLY SPACED PREDETERMINED FREQUENCIES, WHEREIN THE HIGHERFREQUENCY POLES OF SAID TRANSFER CHARACTERISTIC CORRESPOND IN FREQUENCYTO HIGH FREQUENCY SPEECH FORMANTS, AND A PLURALITY OF SERIES-CONNECTEDANTIRESONANT CIRCUITS PRECEDING SAID RESONANT CIRCUIT, EACH OF SAIDANTIRESONANT CIRCUITS HAVING A TRANSFER CHARACTERISTIC WITH NO POLES ANDA ZERO AT A PREDETERMINED FREQUENCY CORRESPONDING TO AN UNWANTED POLE INSAID TRANSFER CHARACTERISTIC OF SAID RESONANT CIRCUIT, AND MEANS FORAPPLYING SAID ARTIFICIAL SPEECH SPECTRUM TO SAID SECOND SYNTHESIZINGMEANS.